Real / Fake Job Post Prediction

  • Dina Bishr - 900181303
  • Mariam Daabis - 900182265

Problem Statement

As a result of the pandemic, there has been a significant increase in the number of online jobs offered on various employment portals. It has been reported that not all job postings are legitimate, which can pose problems with the job posting website and their credibility. It can also be a security threat since the scammers can be using the applicants' information to steal their identities. Using advanced deep learning techniques, we are trying to predict whether these job postings are real or fake to be filtered early on.

Dataset

Our problem contains only one available dataset provided by The University of the Aegean and available on kaggle; it was used in many models and research papers.

Dataset Name: Dataset of real and fake job postings

Link: https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction

Size: 50MB

This dataset includes almost 18 thousand job posts, 866 of them are fake jobs. The dataset consists of 18 columns including:



Input/Output Examples

Our input is a string containing the job description and the output is whether the job post is fradulent or not



State of the art

BERT : Bidirectional Encoder Representations from Transformers

One of the most powerful NLP models that are currently being used in recent studies. It is capable of processing text both left-to-right and right-to-left at the same time.

It also provides pre-trained models on large amounts of data to facilitate tasks such as semantic labeling and, more importantly, in our case, sentence classification.



Also, we can find below the state of the art results acheived by several models on the same dataset we are using:


Orignial Model from Literature

Orginal model used was Bi-LSTM



Our orignal code model consisted of the following layers:


Proposed Updates

Update #1: Applied oversampling to the dataset

Update #2: Introduce tokenization and pad sequenece to model layers


Update #3: Add Batch Normalization and global Max Pool 1D layers


Results

We acheived an accuracy of 96% however, oversampling did not solve overfitting issue



Technical report

  • Programming framework: tensorflow
  • Training hardware: colab
  • Training time: 3 hours
  • Number of epochs:3
  • Time per epoch : 1 hour

Conclusion

Not all techniques that solve overfitting work. We have tried oversampling the data and it still overfit

Since the model takes a lot of time to train it was very difficult to do extensive hyper-parameter tuning and variations in the model architecture.

GloVe and the cleaning of the text and NLP in general highly affected the data thus the results were extremely different in a good way so we recommend it.